31 research outputs found
Multi-sense Embeddings through a Word Sense Disambiguation Process
Natural Language Understanding has seen an increasing number of publications in the last years, especially after robust word embedding models became popular. These models gained a special place in the spotlight when they proved themselves able to capture and represent semantic relations underneath huge amounts of data. Nevertheless, traditional models often fall short in intrinsic issues of linguistics, such as polysemy and homonymy. Multi-sense word embeddings were devised to alleviate these and other problems by representing each word-sense separately, but studies in this area are still in its infancy and much can be explored. We follow this scenario by proposing an unsupervised technique that disambiguates and annotates words by their specific sense, considering their context influence. These are later used to train a word embeddings model to produce a more accurate vector representation. We test our approach in 6 different benchmarks for the word similarity task, showing that our approach can sustain good results and often outperforms current state-of-the-art systems.https://deepblue.lib.umich.edu/bitstream/2027.42/145475/3/tacl.pdfDescription of tacl.pdf : WorkingPape
Paraphrase Types for Generation and Detection
Current approaches in paraphrase generation and detection heavily rely on a
single general similarity score, ignoring the intricate linguistic properties
of language. This paper introduces two new tasks to address this shortcoming by
considering paraphrase types - specific linguistic perturbations at particular
text positions. We name these tasks Paraphrase Type Generation and Paraphrase
Type Detection. Our results suggest that while current techniques perform well
in a binary classification scenario, i.e., paraphrased or not, the inclusion of
fine-grained paraphrase types poses a significant challenge. While most
approaches are good at generating and detecting general semantic similar
content, they fail to understand the intrinsic linguistic variables they
manipulate. Models trained in generating and identifying paraphrase types also
show improvements in tasks without them. In addition, scaling these models
further improves their ability to understand paraphrase types. We believe
paraphrase types can unlock a new paradigm for developing paraphrase models and
solving tasks in the future.Comment: Published at EMNLP 202
Semantic Feature Extraction Using Multi-Sense Embeddings and Lexical Chains
The relationship between words in a sentence often tell us more about the underlying semantic content of a document than its actual words individually. Natural language understanding has seen an increasing effort in the formation of techniques that try to produce non-trivial features, in the last few years, especially after robust word embeddings models became prominent, when they proved themselves able to capture and represent semantic relationships from massive amounts of data. These new dense vector representations indeed leverage the baseline in natural language processing, but they still fall short in dealing with intrinsic issues in linguistics, such as polysemy and homonymy. Systems that make use of natural language at its core, can be affected by a weak semantic representation of human language, resulting in inaccurate outcomes based on poor decisions.
In this subject, word sense disambiguation and lexical chains have been exploring alternatives to alleviate several problems in linguistics, such as semantic representation, definitions, differentiation, polysemy, and homonymy. However, little effort is seen in combining recent advances in token embeddings (e.g. words, documents) with word sense disambiguation and lexical chains. To collaborate in building a bridge between these areas, this work proposes a collection of algorithms to extract semantic features from large corpora as its main contributions, named MSSA, MSSA-D, MSSA-NR, FLLC II, and FXLC II. The MSSA techniques focus on disambiguating and annotating each word by its specific sense, considering the semantic effects of its context. The lexical chains group derive the semantic relations between consecutive words in a document in a dynamic and pre-defined manner. These original techniques' target is to uncover the implicit semantic links between words using their lexical structure, incorporating multi-sense embeddings, word sense disambiguation, lexical chains, and lexical databases.
A few natural language problems are selected to validate the contributions of this work, in which our techniques outperform state-of-the-art systems. All the proposed algorithms can be used separately as independent components or combined in one single system to improve the semantic representation of words, sentences, and documents. Additionally, they can also work in a recurrent form, refining even more their results.Ph.D.College of Engineering & Computer ScienceUniversity of Michigan-Dearbornhttps://deepblue.lib.umich.edu/bitstream/2027.42/149647/1/Terry Ruas Final Dissertation.pdfDescription of Terry Ruas Final Dissertation.pdf : Dissertatio
Exploring and Expanding the Use of Lexical Chains in Information Retrieval
This technical report explains our advances in the arena of exploring lexical chains construction using WordNet and proposed algorithms for different types of structures.https://deepblue.lib.umich.edu/bitstream/2027.42/136659/1/LexicalChainsReport.pdf-1Description of LexicalChainsReport.pdf : Technical repor
Are Neural Language Models Good Plagiarists? A Benchmark for Neural Paraphrase Detection
The rise of language models such as BERT allows for high-quality text
paraphrasing. This is a problem to academic integrity, as it is difficult to
differentiate between original and machine-generated content. We propose a
benchmark consisting of paraphrased articles using recent language models
relying on the Transformer architecture. Our contribution fosters future
research of paraphrase detection systems as it offers a large collection of
aligned original and paraphrased documents, a study regarding its structure,
classification experiments with state-of-the-art systems, and we make our
findings publicly available
How Large Language Models are Transforming Machine-Paraphrased Plagiarism
The recent success of large language models for text generation poses a
severe threat to academic integrity, as plagiarists can generate realistic
paraphrases indistinguishable from original work. However, the role of large
autoregressive transformers in generating machine-paraphrased plagiarism and
their detection is still developing in the literature. This work explores T5
and GPT-3 for machine-paraphrase generation on scientific articles from arXiv,
student theses, and Wikipedia. We evaluate the detection performance of six
automated solutions and one commercial plagiarism detection software and
perform a human study with 105 participants regarding their detection
performance and the quality of generated examples. Our results suggest that
large models can rewrite text humans have difficulty identifying as
machine-paraphrased (53% mean acc.). Human experts rate the quality of
paraphrases generated by GPT-3 as high as original texts (clarity 4.0/5,
fluency 4.2/5, coherence 3.8/5). The best-performing detection model (GPT-3)
achieves a 66% F1-score in detecting paraphrases
Analyzing Multi-Task Learning for Abstractive Text Summarization
Despite the recent success of multi-task learning and pre-finetuning for
natural language understanding, few works have studied the effects of task
families on abstractive text summarization. Task families are a form of task
grouping during the pre-finetuning stage to learn common skills, such as
reading comprehension. To close this gap, we analyze the influence of
multi-task learning strategies using task families for the English abstractive
text summarization task. We group tasks into one of three strategies, i.e.,
sequential, simultaneous, and continual multi-task learning, and evaluate
trained models through two downstream tasks. We find that certain combinations
of task families (e.g., advanced reading comprehension and natural language
inference) positively impact downstream performance. Further, we find that
choice and combinations of task families influence downstream performance more
than the training scheme, supporting the use of task families for abstractive
text summarization